Suppose we have a report and we want to find the sentences that are talking about numerical things....
Originally inspired by When you get data in sentences: how to use a spreadsheet to extract numbers from phrases, Paul Bradshaw, Online Journalism blog, form which some of the example sentences (sic!) are taken.
sentences = [
'4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months',
'No quantities here',
'I measured it as 2 meters and 30 centimeters.',
"four years and six months' imprisonment with a licence extension of 2 years and 6 months",
'it cost £250... bargain...',
'it weighs four hundred kilograms.',
'It weighs 400kg.',
'three million, two hundred & forty, you say?',
'it weighs four hundred and twenty kilograms.'
]
quantulum3
¶quantulum3
is a Python package "for information extraction of quantities from unstructured text".
#!pip3 install quantulum3
from quantulum3 import parser
for sent in sentences:
print(sent)
p = parser.parse(sent)
if p:
print('\tSpoken:',parser.inline_parse_and_expand(sent))
print('\tNumeric elements:')
for q in p:
display(q)
print('\t\t{} :: {}'.format(q.surface, q))
print('\n---------\n')
4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months Spoken: four years and six months’ imprisonment with a licence extension of two years and six months Numeric elements:
Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)")
4 years :: four years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
6 months :: six months
Quantity(2, "Unit(name="year", entity=Entity("time"), uri=Year)")
2 years :: two years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
6 months :: six months --------- No quantities here --------- I measured it as 2 meters and 30 centimeters. Spoken: I measured it as two metres and thirty centimetres. Numeric elements:
Quantity(2, "Unit(name="metre", entity=Entity("length"), uri=Metre)")
2 meters :: two metres
Quantity(30, "Unit(name="centimetre", entity=Entity("length"), uri=Centimetre)")
30 centimeters :: thirty centimetres --------- four years and six months' imprisonment with a licence extension of 2 years and 6 months Spoken: four years and six months imprisonment with a licence extension of two years and six months Numeric elements:
Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)")
four years :: four years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
six months' :: six months
Quantity(2, "Unit(name="year", entity=Entity("time"), uri=Year)")
2 years :: two years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
6 months :: six months --------- it cost £250... bargain... Spoken: it cost two hundred and fifty pounds sterling, zero pence... bargain... Numeric elements:
Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)")
£250 :: two hundred and fifty pounds sterling, zero pence --------- it weighs four hundred kilograms. Spoken: it weighs four hundred kilograms. Numeric elements:
Quantity(400, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
four hundred kilograms :: four hundred kilograms --------- It weighs 400kg. Spoken: It weighs four hundred kilograms. Numeric elements:
Quantity(400, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
400kg :: four hundred kilograms --------- three million, two hundred & forty, you say? Spoken: three million, two hundred & forty, you say? Numeric elements:
Quantity(3e+06, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")
three million :: three million
Quantity(200, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")
two hundred :: two hundred
Quantity(40, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")
forty :: forty --------- it weighs four hundred and twenty kilograms. Spoken: it weighs four hundred and twenty kilograms. Numeric elements:
Quantity(420, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
four hundred and twenty kilograms :: four hundred and twenty kilograms ---------
If we have a large blog of text, we might want to quickly skim it for quantity containing sentences, we can do something like the following...
import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
text = '''
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250.
It was blue. It took forty five minutes to get it home.
What a day that was. I didn't get back until 2.15pm. Then I had some cake for tea.
'''
doc = nlp(text)
for sent in doc.sents:
print(sent)
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250. It was blue. It took forty five minutes to get it home. What a day that was. I didn't get back until 2.15pm. Then I had some cake for tea.
for sent in doc.sents:
sent = sent.text
p = parser.parse(sent)
if p:
print('\tSpoken:',parser.inline_parse_and_expand(sent))
print('\tNumeric elements:')
for q in p:
display(q)
print('\t\t{} :: {}'.format(q.surface, q))
print('\n---------\n')
Spoken: Once upon one instance, there was a thing. Numeric elements:
Quantity(1, "Unit(name="count", entity=Entity("dimensionless"), uri=Count_data)")
a time :: one instance --------- Spoken: The thing weighed forty kilograms and cost two hundred and fifty pounds sterling, zero pence. Numeric elements:
Quantity(40, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
forty kilogrammes :: forty kilograms
Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)")
£250 :: two hundred and fifty pounds sterling, zero pence --------- --------- Spoken: It took forty-five minutes to get it home. Numeric elements:
Quantity(45, "Unit(name="minute of arc", entity=Entity("angle"), uri=Minute_and_second_of_arc)")
forty five minutes :: forty-five minutes --------- Spoken: What one day that was. Numeric elements:
Quantity(1, "Unit(name="day", entity=Entity("time"), uri=Day)")
a day :: one day --------- Spoken: I didn't get back until two point one five picometres. Numeric elements:
Quantity(2.15, "Unit(name="picometre", entity=Entity("length"), uri=Picometre)")
2.15pm :: two point one five picometres --------- ---------
Can we extract numbers from sentences in a CSV file? Yes we can...
url = 'https://raw.githubusercontent.com/BBC-Data-Unit/unduly-lenient-sentences/master/ULS+for+Sankey.csv'
import pandas as pd
df = pd.read_csv(url)
df.head()
Year | Offence category REFINED | Original sentence (refined) | Crown Court | Outcome of Decision | Revised? | People | Top 7 | |
---|---|---|---|---|---|---|---|---|
0 | 2015 | Drug offence | 3 years imprisonment | Bristol | Not referred | No | 1 | Y |
1 | 2015 | Death or serious injury - unlawful driving | 6 years imprisonment - Disqualified driving - ... | Portsmouth | Not referred | No | 1 | Y |
2 | 2015 | Sexual offence | 9 months imprisonment suspended for 2 years | Nottingham | Out of time | No | 1 | Y |
3 | 2015 | Theft offence | 4 years and 10 months imprisonment - consecuti... | St Albans | Not referred | No | 1 | Y |
4 | 2015 | Theft offence | unknown | unknown | Not in scheme | No | 1 | Y |
#get a row
df.iloc[1]
Year 2015 Offence category REFINED Death or serious injury - unlawful driving Original sentence (refined) 6 years imprisonment - Disqualified driving - ... Crown Court Portsmouth Outcome of Decision Not referred Revised? No People 1 Top 7 Y Name: 1, dtype: object
#and a, erm. sentence...
df.iloc[1]['Original sentence (refined)']
'6 years imprisonment - Disqualified driving - 8 years'
parser.parse(df.iloc[1]['Original sentence (refined)'])
[Quantity(6, "Unit(name="year", entity=Entity("time"), uri=Year)"), Quantity(8, "Unit(name="year", entity=Entity("time"), uri=Year)")]
def amountify(txt):
try:
if txt:
p = parser.parse(txt)
x=[]
for q in p:
x.append( '{} {}'.format(q.value, q.unit.name))
return '::'.join(x)
return ''
except:
return
df['amounts'] = df['Original sentence (refined)'].apply(amountify)
df.head()
Year | Offence category REFINED | Original sentence (refined) | Crown Court | Outcome of Decision | Revised? | People | Top 7 | amounts | |
---|---|---|---|---|---|---|---|---|---|
0 | 2015 | Drug offence | 3 years imprisonment | Bristol | Not referred | No | 1 | Y | 3.0 year |
1 | 2015 | Death or serious injury - unlawful driving | 6 years imprisonment - Disqualified driving - ... | Portsmouth | Not referred | No | 1 | Y | 6.0 year::8.0 year |
2 | 2015 | Sexual offence | 9 months imprisonment suspended for 2 years | Nottingham | Out of time | No | 1 | Y | 9.0 month::2.0 year |
3 | 2015 | Theft offence | 4 years and 10 months imprisonment - consecuti... | St Albans | Not referred | No | 1 | Y | 4.0 year::10.0 month |
4 | 2015 | Theft offence | unknown | unknown | Not in scheme | No | 1 | Y |
We could then do something to split mutliple amounts into mutliple rows or columns...